CLAIMS 



WHAT IS CLAIMED IS: 

1 . A method for predicting a polyadenylation site comprising: 

inputting a plurality of RNA transcript sequences or sequences dervied from RNA 
transcript sequences, wherein at least one sequence has its poly A or poly T tract sequence; 

searching for a polyadenylation site, wherein the polyadenylation is an adenine rich 
region at the end of the sequence or a thymine rich region at the beginning of the sequence; 

detecting the presence of polyadenylation signals neigboring the polyadenylation site 
by scanning the EST or RNA sequences or their corresponding genomic DNA sequences. 

2. The method of Claim 1 wherein the step of searching for a polyadenylation site 
comprising scanning the sequences for adenine rich region at the end of the sequence 
or a thymine rich region at the begining of the sequence. 

3. The method of Claim 2 wherein the adenine rich region comprises adenine in at least 
50% of the region and the thymine rich region comprises thymine in at least 50% of 
the region. 

4. The method of Claim 2 wherein the adenine rich region comprises adenine in at least 
60% of the region and the thymine rich region comprises thymine in at least 60% of 
the region. 

5. The method of Claim 2 wherein the adenine rich region comprises adenine in at least 
70% of the region and the thymine rich region comprises thymine in at least 70% of 
the region. 
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The method of Claim 2 wherein the adenine rich region comprises adenine in at least 
80% of the region and the thymine rich region comprises thymine in at least 80% of 
the region. 

The method of Claim 1 wherein a heuristic score nA / (nA + 0.5*(max(nR-20,0))) is 
used for detecting adenine or thymine rich region; wherein nA is the number of 
adenines or thymines in the block, and na is the number of bases after the block of 
adenines or thymine to the end of the sequence. 

A method for detecting polyadenylation signal in a sequence with a polyadenylation 
site comprising searching for a polyadenylation signal hexamer in the sequence 
before the polyadenylation. 

The method of Claim 8 wherein the searching comprises evaluating the probability 
that there is a polyadenylation site: Pr(h=k|x) for k = 6,7,. ..,N, wherein the sequence 
before the polyadenylation site is x=(xi,X2,...xn) and where xn is the 3'-most base 
before the polyadenylation site. 

The method of Claim 9 wherein: Pr(h=k|x) = Pr(x|h=k) Pr(h=k)/Pr(x). 

The method of Claim 10 wherein Pr(h=k|x) =Pr(xk-5,...,Xk|h-k) Pr(h=k)/Pr(xk-5,...,Xk) 
and wherein Pr(h=k) is the probability that the polyadenylation hexamer is located at 
position k in the sequence, at a distance (N-k) from the polyadenylation site, Pr(xk- 
5,...,Xk|h=k) is the probability of observing the hexamer (xk-5,...>Xk) given that it is a 
polyadenylation signal and Pr(xk.5,...,Xk|h*) is the probability of observing the 
hexamer given that it is not from a polyadenylation signal. 

The method of Claim 1 1 wherein the step of detecting comprises using a gamma 
fimction to produce a density which places the majority of its weight on the positions 
located 5 to 25 bases distant from the polyadenylation site. 
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13. 



The method of Claim 12 wherein Pr(xk-5,...,Xk|h5*), the probability of observing the 
hexamer given that it is not from a polyadenylation signal, is modeled using a 
second-order Markov model trained on data collected from himian 3* UTRs. 



1 4. The method of Claim 1 3 wherein Pr(xk-5, . . . ,Xk|h ^)=Pr(Xk.5) Pr(xk^|xk-5) Pr(xk-3 |xk- 
5,Xk-4) Pr(xk-2|xk-4,Xk-3) Pr(xk-i |xk-3,Xk-2) Pr(xk|xk-2,Xk-i), wherein the first term is a zero- 
order Markovian probability, the second is a first-order Markovian probabiHty and 
the remaining four terms are second-order Markovian probabilities, 

15. The method of Claim 14 wherein, for a k^'^-order Markov model, the probability of 
base b following a word w of length k is estimated by the frequency of the 
concatenated word (wb) divided by the frequency of the word w, where frequencies 
are computed from the training dataset of 3'UTR sequences. 

1 6. The method of Claim 1 5 wherein, for the case k=0 (a zero-order Markovian model), 
the probability of base b is estimated by its frequency in the dataset divided by the 
size of the dataset. 

17. A computer readable medium comprising computer-executable instructions for 
performing the method comprising: 

inputting a plurality of RNA transcript sequences or sequences dervied from RNA 
transcript sequences, wherein at least one sequence has its poly A or poly T tract sequence; 

searching for a polyadenylation site, wherein the polyadenylation is an adenine rich 
region at the end of the sequence or a thymine rich region at the beginning of the sequence; 

detecting the presence of polyadenylation signals neigboring the polyadenylation site 
by scanning the EST or RNA sequences or their corresponding genomic DNA sequences. 
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The computer readable medium of Claim 17 wherein the step of searching for a 
polyadenylation site comprising scanning the sequences for adenine rich region at the 
end of the sequence or a thymine rich region at the begining of the sequence. 

The computer readable medium of Claim 18 wherein the adenine rich region 
comprises adenine in at least 50% of the region and the thymine rich region 
comprises thymine in at least 50% of the region. 

The computer readable medium of Claim 19 wherein the adenine rich region 
comprises adenine in at least 60% of the region and the thymine rich region 
comprises thymine in at least 60% of the region. 

The computer readable medium of Claim 20 wherein the adenine rich region 
comprises adenine in at least 70% of the region and the thymine rich region 
comprises thymine in at least 70% of the region. 

The computer readable medium of Claim 21 wherein the adenine rich region 
comprises adenine in at least 80% of the region and the thymine rich region 
comprises thymine in at least 80% of the region. 

The computer readable medium of Claim 17 wherein a heuristic score nA / (nA + 
0.5*(max(nR-20,0))) is used for detecting adenine or thymine rich region; wherein 
ua is the number of adenines or thymines in the block, and ur is the number of bases 
after the block of adenines or thymine to the end of the sequence. 

A computer readable medium comprising computer-executable instructions for 
performing the method comprising: searching for a polyadenylation signal hexamer 
in the sequence before the polyadenylation. 
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The computer readable medium of Claim 24 wherein the searching comprises 
evaluating the probability that there is a polyadenylation site: Pr(h=k|x) for k = 
6,7,...,N, wherein the sequence before the polyadenylation site is x=(xi,X2,...xn) and 
where xn is the 3'-most base before the polyadenylation site. 

The computer readable medium of Claim 25 wherein: Pr(h=k|x) = Pr(x|h=k) 
Pr(h=k)/Pr(x). 

The computer readable medium of Claim 26 wherein: Pr(h=k|x) =Pr(xk-5,...,Xk|h=k) 
Pr(h=k)/Pr(xk-5,...,Xk) and wherein Pr(h=k) is the probability that the polyadenylation 
hexamer is located at position k in the sequence, at a distance (N-k) from the 
polyadenylation site, Pr(xk-5v..,Xk|h=k) is the probability of observing the hexamer 
(xk-5v..,Xk) given that it is a polyadenylation signal and Pr(xk-5v..,Xk|h*) is the 
probability of observing the hexamer given that it is not from a polyadenylation 
signal. 

The computer readable medium of Claim 27 wherein the step of detecting comprises 
using a gamma function to produce a density which places the majority of its weight 
on the positions located 5 to 25 bases distant from the polyadenylation site. 

The computer readable medium of Claim 28 wherein Pr(xk.5,...,Xk|h5t), the 
probabiUty of observing the hexamer given that it is not from a polyadenylation 
signal, is modeled using a second-order Markov model trained on data collected from 
human 3* UTRs. 

The computer readable medium of Claim 29 wherein Pr(xk-5,.",Xk|h5*)=Pr(Xk.5) 
Pr(xk^|xk.5) Pr(xk.3|xk-5,Xk^) Pr(xk-2|xk-4,Xk.3) Pr(xk.i|xk.3,Xk.2) Pr(xk|xk-2,xk.i), wherein 
the first term is a zero-order Markovian probability, the second is a first-order 
Markovian probabiUty and the remaining four terms are second-order Markovian 
probabilities. 
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31. The computer readable medium of Claim 30 wherein, for a k^^'-order Markov model, 
the probability of base b following a word w of length k is estimated by the 
frequency of the concatenated word (wb) divided by the frequency of the word w, 
where frequencies are computed from the training dataset of 3'UTR sequences. 

32. The computer readable medium of Claim 3 1 wherein, for the case k=0 (a zero-order 
Markovian model), the probability of base b is estimated by its frequency in the 
dataset divided by the size of the dataset. 

33. A system comprising: a processor; and a memory coupled with the processor, the 
memory storing a plurality of machine instructions that cause the processor to 
perform logical steps of the method comprising: 

inputting a plurality of RNA transcript sequences or sequences dervied from RNA 
transcript sequences, wherein at least one sequence has its poly A or poly T tract sequence; 

searching for a polyadenylation site, wherein the polyadenylation is an adenine rich 
region at the end of the sequence or a thymine rich region at the beginning of the sequence; 

detecting the presence of polyadenylation signals neigboring the polyadenylation site 
by scanning the EST or RNA sequences or their corresponding genomic DNA sequences. 

34. The system of Claim 33 wherein the step of searching for a polyadenylation site 
comprising scanning the sequences for adenine rich region at the end of the sequence 
or a thymine rich region at the begining of the sequence. 

35. The system of Claim 34 wherein the adenine rich region comprises adenine in at least 
50% of the region and the thymine rich region comprises thymine in at least 50% of 
the region. 
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36. 



The system of Claim 35 wherein the adenine rich region comprises adenine in at least 
60% of the region and the thymine rich region comprises thymine in at least 60% of 
the region. 
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37. The system of Claim 36 wherein the adenine rich region comprises adenine in at least 
70% of the region and the thymine rich region comprises thymine in at least 70% of 
the region. 

38. The system of Claim 37 wherein the adenine rich region comprises adenine in at least 
80% of the region and the thymine rich region comprises thymine in at least 80% of 
the region. 

39. The system of Claim 33 wherein a heuristic score nA / (nA + 0.5*(max(nR-2050))) is 
used for detecting adenine or thymine rich region; wherein: nA is the number of 
adenines or thymines in the block, and na is the number of bases after the block of 
adenines or thymine to the end of the sequence. 

40. A system comprising a processor; and a memory coupled with the processor, the 
memory storing a plurality of machine instructions that cause the processor to 
perform logical steps of the method for detecting polyadenylation signal in a 
sequence with a polyadenylation site comprising: searching for a polyadenylation 
signal hexamer in the sequence before the polyadenylation. 

4 1 . The system of Claim 40 wherein the searching comprises evaluating the probability 
that there is a polyadenylation site: Pr(h=k|x) for k = 6,7,... ,N, wherein the sequence 
before the polyadenylation site is x=(xi,X2,...xn) and where xn is the 3 -most base 
before the polyadenylation site. 

42. The system of Claim 41 wherein: Pr(h=k|x) = Pr(x|h=k) Pr(h=k)/Pr(x). 
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The system of Claim 42 wherein Pr(h=k|x) =Pr(xk.5,...>Xk|h=k) Pr(h=k)/Pr(xk.5v..,Xk) 
and wherein Pr(h=k) is the probabiUty that the polyadenylation hexamer is located at 
position k in the sequence, at a distance (N-k) from the polyadenylation site, Pr(xk- 
5,...,Xk|h=k) is the probability of observing the hexamer (xk-Sv^Xk) given that it is a 
polyadenylation signal and Pr(xk-5,...,Xk|h^) is the probability of observing the 
hexamer given that it is not from a polyadenylation signal. 

The system of Claim 43 wherein the step of detecting comprises using a ganmia 
fiinction to produce a density which places the majority of its weight on the positions 
located 5 to 25 bases distant from the polyadenylation site* 

The system of Claim 44 wherein Pr(xk-5,...,Xk|h5*^), the probability of observing the 
hexamer given that it is not from a polyadenylation signal, is modeled using a 
second-order Markov model trained on data collected from human 3' UTRs. 

The system of Claim 45 wherein Pr(xk.5v..,Xk|h5*)=Pr(Xk,5) Pr(xk^|xk-5) Pr(xk-3|xk-5,Xk- 
4) Pr(xk-2|xk-4,Xk-3) Pr(xk-i|xk-3,Xk-2) Pr(xk|xk-2,Xk-i), wherein the first term is a zero- 
order Markovian probability, the second is a first-order Markovian probabihty and 
the remaining four terms are second-order Markovian probabilities. 

The system of Claim 46 wherein, for a k^^-order Markov model, the probability of 
base b following a word w of length k is estimated by the frequency of the 
concatenated word (wb) divided by the frequency of the word w, where frequencies 
are computed from the training dataset of 3'UTR sequences. 

The system of Claim 47 wherein, for the case k=0 (a zero-order Markovian model), 
the probabihty of base b is estimated by its frequency in the dataset divided by the 
size of the dataset. 
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